{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0d6aba5a-abde-4e41-8a54-ef7e0de9e8a3",
   "metadata": {},
   "source": [
    "# Query Pipeline over Pandas DataFrames\n",
    "\n",
    "This is a simple example that builds a query pipeline that can perform structured operations over a Pandas DataFrame to satisfy a user query, using LLMs to infer the set of operations.\n",
    "\n",
    "This can be treated as the \"from-scratch\" version of our `PandasQueryEngine`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "beea049c",
   "metadata": {},
   "outputs": [],
   "source": [
    "%pip install llama-index-llms-openai"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e16e7f46-6a3f-4390-b149-51d49a255d54",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.core.query_pipeline import (\n",
    "    QueryPipeline as QP,\n",
    "    Link,\n",
    "    InputComponent,\n",
    ")\n",
    "from llama_index.core.query_engine.pandas import PandasInstructionParser\n",
    "from llama_index.llms.openai import OpenAI\n",
    "from llama_index.core import PromptTemplate"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "844f1db7-8eaa-4fb2-91f4-a363db66bd5b",
   "metadata": {},
   "source": [
    "## Download Data\n",
    "\n",
    "Here we load the Titanic CSV dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "31c3d6f4-ec01-4992-8dbf-696b6feeda5f",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "--2024-01-13 18:39:07--  https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/csv/titanic_train.csv\n",
      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 2606:50c0:8003::154, 2606:50c0:8001::154, 2606:50c0:8002::154, ...\n",
      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|2606:50c0:8003::154|:443... connected.\n",
      "HTTP request sent, awaiting response... 200 OK\n",
      "Length: 57726 (56K) [text/plain]\n",
      "Saving to: ‘titanic_train.csv’\n",
      "\n",
      "titanic_train.csv   100%[===================>]  56.37K  --.-KB/s    in 0.007s  \n",
      "\n",
      "2024-01-13 18:39:07 (7.93 MB/s) - ‘titanic_train.csv’ saved [57726/57726]\n",
      "\n"
     ]
    }
   ],
   "source": [
    "!wget 'https://raw.githubusercontent.com/jerryjliu/llama_index/main/docs/examples/data/csv/titanic_train.csv' -O 'titanic_train.csv'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b2f94d45-607c-4f04-a05c-b66ecedfe233",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "df = pd.read_csv(\"./titanic_train.csv\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b42ccfe1-fd13-4356-b5b1-c9eabac5f391",
   "metadata": {},
   "source": [
    "## Define Modules\n",
    "\n",
    "Here we define the set of modules: \n",
    "1. Pandas prompt to infer pandas instructions from user query\n",
    "2. Pandas output parser to execute pandas instructions on dataframe, get back dataframe\n",
    "3. Response synthesis prompt to synthesize a final response given the dataframe\n",
    "4. LLM\n",
    "\n",
    "The pandas output parser specifically is designed to safely execute Python code. It includes a lot of safety checks that may be annoying to write from scratch. This includes only importing from a set of approved modules (e.g. no modules that would alter the file system like `os`), and also making sure that no private/dunder methods are being called."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d47ac723-e1fc-47c2-9736-19deb50d87ae",
   "metadata": {},
   "outputs": [],
   "source": [
    "instruction_str = (\n",
    "    \"1. Convert the query to executable Python code using Pandas.\\n\"\n",
    "    \"2. The final line of code should be a Python expression that can be called with the `eval()` function.\\n\"\n",
    "    \"3. The code should represent a solution to the query.\\n\"\n",
    "    \"4. PRINT ONLY THE EXPRESSION.\\n\"\n",
    "    \"5. Do not quote the expression.\\n\"\n",
    ")\n",
    "\n",
    "pandas_prompt_str = (\n",
    "    \"You are working with a pandas dataframe in Python.\\n\"\n",
    "    \"The name of the dataframe is `df`.\\n\"\n",
    "    \"This is the result of `print(df.head())`:\\n\"\n",
    "    \"{df_str}\\n\\n\"\n",
    "    \"Follow these instructions:\\n\"\n",
    "    \"{instruction_str}\\n\"\n",
    "    \"Query: {query_str}\\n\\n\"\n",
    "    \"Expression:\"\n",
    ")\n",
    "response_synthesis_prompt_str = (\n",
    "    \"Given an input question, synthesize a response from the query results.\\n\"\n",
    "    \"Query: {query_str}\\n\\n\"\n",
    "    \"Pandas Instructions (optional):\\n{pandas_instructions}\\n\\n\"\n",
    "    \"Pandas Output: {pandas_output}\\n\\n\"\n",
    "    \"Response: \"\n",
    ")\n",
    "\n",
    "pandas_prompt = PromptTemplate(pandas_prompt_str).partial_format(\n",
    "    instruction_str=instruction_str, df_str=df.head(5)\n",
    ")\n",
    "pandas_output_parser = PandasInstructionParser(df)\n",
    "response_synthesis_prompt = PromptTemplate(response_synthesis_prompt_str)\n",
    "llm = OpenAI(model=\"gpt-3.5-turbo\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1b7907f6-2926-4c83-9541-d089d0b0af88",
   "metadata": {},
   "source": [
    "## Build Query Pipeline\n",
    "\n",
    "Looks like this:\n",
    "input query_str -> pandas_prompt -> llm1 -> pandas_output_parser -> response_synthesis_prompt -> llm2\n",
    "\n",
    "Additional connections to response_synthesis_prompt: llm1 -> pandas_instructions, and pandas_output_parser -> pandas_output."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3544a6a8-3866-4806-8108-d908bd5d17e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "qp = QP(\n",
    "    modules={\n",
    "        \"input\": InputComponent(),\n",
    "        \"pandas_prompt\": pandas_prompt,\n",
    "        \"llm1\": llm,\n",
    "        \"pandas_output_parser\": pandas_output_parser,\n",
    "        \"response_synthesis_prompt\": response_synthesis_prompt,\n",
    "        \"llm2\": llm,\n",
    "    },\n",
    "    verbose=True,\n",
    ")\n",
    "qp.add_chain([\"input\", \"pandas_prompt\", \"llm1\", \"pandas_output_parser\"])\n",
    "qp.add_links(\n",
    "    [\n",
    "        Link(\"input\", \"response_synthesis_prompt\", dest_key=\"query_str\"),\n",
    "        Link(\n",
    "            \"llm1\", \"response_synthesis_prompt\", dest_key=\"pandas_instructions\"\n",
    "        ),\n",
    "        Link(\n",
    "            \"pandas_output_parser\",\n",
    "            \"response_synthesis_prompt\",\n",
    "            dest_key=\"pandas_output\",\n",
    "        ),\n",
    "    ]\n",
    ")\n",
    "# add link from response synthesis prompt to llm2\n",
    "qp.add_link(\"response_synthesis_prompt\", \"llm2\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "dea9cba7-733b-4087-a297-5c2799a8cfe5",
   "metadata": {},
   "source": [
    "## Run Query"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "49237e3f-45f8-4b9b-8e32-0d31555b3539",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\u001b[1;3;38;2;155;135;227m> Running module input with input: \n",
      "query_str: What is the correlation between survival and age?\n",
      "\n",
      "\u001b[0m\u001b[1;3;38;2;155;135;227m> Running module pandas_prompt with input: \n",
      "query_str: What is the correlation between survival and age?\n",
      "\n",
      "\u001b[0m\u001b[1;3;38;2;155;135;227m> Running module llm1 with input: \n",
      "messages: You are working with a pandas dataframe in Python.\n",
      "The name of the dataframe is `df`.\n",
      "This is the result of `print(df.head())`:\n",
      "   survived  pclass                                               name  ...\n",
      "\n",
      "\u001b[0m\u001b[1;3;38;2;155;135;227m> Running module pandas_output_parser with input: \n",
      "input: assistant: df['survived'].corr(df['age'])\n",
      "\n",
      "\u001b[0m\u001b[1;3;38;2;155;135;227m> Running module response_synthesis_prompt with input: \n",
      "query_str: What is the correlation between survival and age?\n",
      "pandas_instructions: assistant: df['survived'].corr(df['age'])\n",
      "pandas_output: -0.07722109457217755\n",
      "\n",
      "\u001b[0m\u001b[1;3;38;2;155;135;227m> Running module llm2 with input: \n",
      "messages: Given an input question, synthesize a response from the query results.\n",
      "Query: What is the correlation between survival and age?\n",
      "\n",
      "Pandas Instructions (optional):\n",
      "df['survived'].corr(df['age'])\n",
      "\n",
      "Pandas ...\n",
      "\n",
      "\u001b[0m"
     ]
    }
   ],
   "source": [
    "response = qp.run(\n",
    "    query_str=\"What is the correlation between survival and age?\",\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "17345c60-65ec-4053-b992-cc1f790f2b55",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The correlation between survival and age is -0.0772. This indicates a weak negative correlation, suggesting that as age increases, the likelihood of survival slightly decreases.\n"
     ]
    }
   ],
   "source": [
    "print(response.message.content)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "llama_index_v2",
   "language": "python",
   "name": "llama_index_v2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
